67 research outputs found
Marrying Universal Dependencies and Universal Morphology
The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects
each present schemata for annotating the morphosyntactic details of language.
Each project also provides corpora of annotated text in many languages - UD at
the token level and UniMorph at the type level. As each corpus is built by
different annotators, language-specific decisions hinder the goal of universal
schemata. With compatibility of tags, each project's annotations could be used
to validate the other's. Additionally, the availability of both type- and
token-level resources would be a boon to tasks such as parsing and homograph
disambiguation. To ease this interoperability, we present a deterministic
mapping from Universal Dependencies v2 features into the UniMorph schema. We
validate our approach by lookup in the UniMorph corpora and find a
macro-average of 64.13% recall. We also note incompatibilities due to paucity
of data on either side. Finally, we present a critical evaluation of the
foundations, strengths, and weaknesses of the two annotation projects.Comment: UDW1
Long-Form Speech Translation through Segmentation with Finite-State Decoding Constraints on Large Language Models
One challenge in speech translation is that plenty of spoken content is
long-form, but short units are necessary for obtaining high-quality
translations. To address this mismatch, we adapt large language models (LLMs)
to split long ASR transcripts into segments that can be independently
translated so as to maximize the overall translation quality. We overcome the
tendency of hallucination in LLMs by incorporating finite-state constraints
during decoding; these eliminate invalid outputs without requiring additional
training. We discover that LLMs are adaptable to transcripts containing ASR
errors through prompt-tuning or fine-tuning. Relative to a state-of-the-art
automatic punctuation baseline, our best LLM improves the average BLEU by 2.9
points for English-German, English-Spanish, and English-Arabic TED talk
translation in 9 test sets, just by improving segmentation.Comment: accepted to the Findings of EMNLP 2023. arXiv admin note: text
overlap with arXiv:2212.0989
Meaning to Form: Measuring Systematicity as Information
A longstanding debate in semiotics centers on the relationship between
linguistic signs and their corresponding semantics: is there an arbitrary
relationship between a word form and its meaning, or does some systematic
phenomenon pervade? For instance, does the character bigram \textit{gl} have
any systematic relationship to the meaning of words like \textit{glisten},
\textit{gleam} and \textit{glow}? In this work, we offer a holistic
quantification of the systematicity of the sign using mutual information and
recurrent neural networks. We employ these in a data-driven and massively
multilingual approach to the question, examining 106 languages. We find a
statistically significant reduction in entropy when modeling a word form
conditioned on its semantic representation. Encouragingly, we also recover
well-attested English examples of systematic affixes. We conclude with the
meta-point: Our approximate effect size (measured in bits) is quite
small---despite some amount of systematicity between form and meaning, an
arbitrary relationship and its resulting benefits dominate human language.Comment: Accepted for publication at ACL 201
Weird inflects but OK : Making sense of morphological generation errors
We conduct a manual error analysis of the CoNLL-SIGMORPHON 2017 Shared Task on Morphological Reinflection. In this task, systems are given a word in citation form (e.g., hug) and asked to produce the corresponding inflected form (e.g., the simple past hugged). This design lets us analyze errors much like we might analyze children's production errors. We propose an error taxonomy and use it to annotate errors made by the top two systems across twelve languages. Many of the observed errors are related to inflectional patterns sensitive to inherent linguistic properties such as animacy or affect; many others are failures to predict truly unpredictable inflectional behaviors. We also find nearly one quarter of the residual "errors" reflect errors in the gold data. © 2019 Association for Computational Linguistics.Peer reviewe
- …